Use 427 tokenizer by JPrevost · Pull Request #9 · MITLibraries/timdex-semantic-builder

JPrevost · 2026-03-06T21:26:52Z

Why are these changes being introduced:

This is the core work to put the tokenizer lambda in place, which is a
critical component of the overall system. It will be used to tokenize
queries for use in the search process.

Relevant ticket(s):

https://mitlibraries.atlassian.net/browse/USE-427

How does this address that need:

This change introduces the tokenizer lambda
It includes the code for the lambda itself, as well as tests and
documentation for how to use it.

Document any side effects to this change:

Adjusted some linting rules from the template to allow better
understanding of what we are skipping and added some skips as
appropriate (mostly around tests).
leaves templated my_function in place as a placeholder. This will
likely be removed in the near future.
Introduces files from a Huggingface model without an automated way
to update them. Whether this is temporary or not will be based on how
often we need to update those files and whether we can find a better
way to manage them.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES

What are the relevant tickets?

Include links to Jira Software and/or Jira Service Management tickets here.

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

- remove dependabot config - update pre-commit settings to pre-push and remove audit - add support for log levels and removed `configure_dev_logger()` - add dotenv support - update license - update makefile linting commands, adds console command - removes template info from readme

Why are these changes being introduced: * This is the core work to put the tokenizer lambda in place, which is a critical component of the overall system. It will be used to tokenize queries for use in the search process. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-427 How does this address that need: * This change introduces the tokenizer lambda * It includes the code for the lambda itself, as well as tests and documentation for how to use it. Document any side effects to this change: * Adjusted some linting rules from the template to allow better understanding of what we are skipping and added some skips as appropriate (mostly around tests). * leaves templated `my_function` in place as a placeholder. This will likely be removed in the near future. * Introduces files from a Huggingface model without an automated way to update them. Whether this is temporary or not will be based on how often we need to update those files and whether we can find a better way to manage them.

qltysh · 2026-03-06T21:28:35Z

All good ✅

lambdas/query_tokenizer.py

tests/test_tokenizer_handler.py

Copilot

Pull request overview

This PR introduces a new “tokenizer” AWS Lambda that converts an input query string into OpenSearch rank_feature clauses using a locally-vendored OpenSearch/Hugging Face tokenizer + IDF weights (USE-427).

Changes:

Add QueryTokenizer + tokenizer_handler Lambda implementation and update the container entrypoint.
Vendor OpenSearch model tokenizer artifacts into the repo and document local usage.
Add unit tests and SAM assets; update dependencies and lint/Makefile workflows to support the new Lambda.

Reviewed changes

Copilot reviewed 17 out of 20 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`lambdas/tokenizer_handler.py`	New Lambda handler that validates input and builds an OpenSearch query from token weights.
`lambdas/query_tokenizer.py`	Implements query tokenization + IDF weighting using a local Hugging Face tokenizer.
`lambdas/config.py`	Adds dotenv loading and makes log level configurable via env var.
`Dockerfile`	Points the Lambda CMD to `lambdas.tokenizer_handler.lambda_handler`.
`pyproject.toml`	Adds runtime deps (`python-dotenv`, `transformers[torch]`) and updates Ruff config.
`uv.lock`	Locks new ML/dependency graph (transformers/torch/tokenizers/etc.) and dev typing dependency.
`tests/test_tokenizer_handler.py`	Adds unit tests for handler output shape and input validation behavior.
`tests/test_query_tokenizer.py`	Adds unit tests for IDF loading and token-to-weight mapping behavior.
`tests/sam/template.yaml`	Updates SAM template to define the `Tokenizer` function and required env vars.
`tests/sam/event.json`	Adds a sample event payload for local SAM invocation.
`tests/sam/env.json.template`	Updates SAM env template to match the new function name and vars.
`tests/sam/template.yaml`	Adjusts function naming/architecture and environment variables for local runs.
`Makefile`	Reworks lint targets around Ruff formatting and adds SAM invoke + console helpers.
`README.md`	Replaces template README with tokenizer-specific usage and local testing instructions.
`LICENSE`	Replaces placeholder copyright owner line.
`.pre-commit-config.yaml`	Changes default stage to `pre-push` and trims commented-out hooks.
`opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte/tokenizer_config.json`	Adds tokenizer config artifact from the model.
`opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte/README.md`	Documents provenance/license for vendored model artifacts.
`.github/dependabot.yml` (removed)	Removes Dependabot configuration file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

lambdas/tokenizer_handler.py

lambdas/query_tokenizer.py

lambdas/tokenizer_handler.py

lambdas/config.py

README.md

opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte/README.md

lambdas/tokenizer_handler.py

pyproject.toml

…e/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ghukill

Despite my incredibly nit picky review, I think this is looking awesome. It's terse, easy to reason about, and includes some nice, new conventions for lambdas.

A few of my comments are silly and ignorable if you'd like (e.g. time.perf_counter() vs time.time()), but the one about the QueryTokenizer._sparse_vector_to_dict() method felt worth a "Comment" PR review.

Happy to discuss and/or recreate local testing, but it feels like the more we can lean into avoiding cast(...) and zip(..., strict=False) the more we might avoid some weird edge case-y bugs.

lambdas/tokenizer_handler.py

lambdas/query_tokenizer.py

tests/test_query_tokenizer.py

lambdas/config.py

matt-bernhardt

This isn't a full-blown review, but a request for a specific change I ran into while trying to make sure I could run this locally. The sample commands in the "Running locally without SAM" need to be updated to use tokenizer_handler instead of my_function.

I've been able to build and start the container, and use it from a separate terminal process, using the updated function name.

In general, With that tweak noted, I've been able to run this locally via Docker and curl, and run the listed make commands successfully. I've also reviewed the contents of the lambdas and opensearch-project folders, and think I could probably pick this up and develop with it when needed. I think that means I'm comfortable moving forward.

ghukill

Approved! Looking great. Thanks for all the conversation and details in comments.

Left an optional comment re: DEBUG logging, just if helpful.

ghukill · 2026-03-12T19:54:06Z

lambdas/tokenizer_handler.py

-    logger.debug("Query tokens for OpenSearch:")
-    logger.debug(json.dumps(query_tokens, indent=2))


As-is, I'm fairly certain (and AI agrees) that the json.dumps(...) bit would be evaluated (work performed) even if a DEBUG log line was not emitted.

But, seeing them removed, I can appreciate how helpful they could be potentially.

I'm fairly certain that % interpolation will lazily evaluate, making it a safe/performant option, but you can't use json.dumps(), e.g.:

logger.debug("Query tokens for OpenSearch: %s", query_tokens)

To include the formatting, you could do something like this:

if logger.isEnabledFor(logging.DEBUG): logger.debug("Query tokens for OpenSearch:") logger.debug(json.dumps(query_tokens, indent=2))

All 100% optional! Having just finished a PR for DEBUG logging in TIM, it's fresh on the brain. Can appreciate a) that it's helpful, and b) annoying to pay CPU cost even when not used. I think either of these approaches will get both.

I'll consider reintroducing them in the future (next time I need them). Thanks for the tips on how to do that safely when I get back to it!

Also adds test for idf.json not existing acting as expected. This test is not absolutely necessary as we aren't adding specific logic outside the default behavior, but it seems small enough to not hurt to confirm the behavior of missing individual files within the path will also work as expected (i.e. fail with an error)

Removes looping logic, type casting, and unsafe non-strict zip. No changes to the generated output occurs, this is purely a better implementation of the same logic. However, the tests required a change to handle the lists instead of looping behavior.

matt-bernhardt

Thanks!

JPrevost added 2 commits March 6, 2026 15:59

JPrevost requested a review from Copilot March 6, 2026 21:27

Copilot started reviewing on behalf of JPrevost March 6, 2026 21:27 View session

qltysh bot reviewed Mar 6, 2026

View reviewed changes

lambdas/query_tokenizer.py Show resolved Hide resolved

tests/test_tokenizer_handler.py Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 6, 2026

View reviewed changes

JPrevost and others added 6 commits March 6, 2026 16:40

Update opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gt…

d942b19

…e/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fixup precommit config

ce5fcdb

Clarify when .env is loaded and update readme

8c57d3a

Remove reference to black from config

c441131

Optimize to cache QueryTokenizer

5de426b

Fixup update-lambda-dev make command

0106bd9

JPrevost requested review from ghukill, jazairi and matt-bernhardt March 10, 2026 13:22

ghukill self-assigned this Mar 11, 2026

ghukill reviewed Mar 11, 2026

View reviewed changes

JPrevost requested a review from ghukill March 12, 2026 18:45

matt-bernhardt requested changes Mar 12, 2026

View reviewed changes

ghukill approved these changes Mar 12, 2026

View reviewed changes

JPrevost requested a review from matt-bernhardt March 12, 2026 20:13

JPrevost added 8 commits March 12, 2026 16:16

Prefer time.perf_counter() for metrics

104431b

Remove expensive logger.debug statement

5392b76

Narrow import statement

7184823

Actually check required env vars

9692097

Remove lingering comment

30a17ac

Refactor _sparse_vector_to_dict

8c78c2d

Removes looping logic, type casting, and unsafe non-strict zip. No changes to the generated output occurs, this is purely a better implementation of the same logic. However, the tests required a change to handle the lists instead of looping behavior.

Do not perform equality checks with floating point values

12750e5

Update docs with better container name example

9edd37e

matt-bernhardt approved these changes Mar 12, 2026

View reviewed changes

JPrevost force-pushed the use-427-tokenizer branch from ee5be3d to 9edd37e Compare March 12, 2026 20:20

JPrevost merged commit 9ab6c10 into main Mar 12, 2026
5 checks passed

JPrevost deleted the use-427-tokenizer branch March 12, 2026 20:32

		logger.debug("Query tokens for OpenSearch:")
		logger.debug(json.dumps(query_tokens, indent=2))

Conversation

JPrevost commented Mar 6, 2026

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

qltysh bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matt-bernhardt left a comment

Choose a reason for hiding this comment

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

ghukill Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

JPrevost Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

matt-bernhardt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qltysh bot commented Mar 6, 2026 •

edited

Loading